Helper function given by Professor

Loading the dataset

Data Cleaning and Preprocessing

In this section, we will deal with data cleaning and preprocessing it using various methods.

Finding missing values and values with timestamp as 0 or with x-y-z axis both as zero

It appears that out of 1086466 entries, 12842 contain values which are trash and do not make sense, so we will drop those and update our df.

We also see that z-has is suffixed with a semicolon, we need to remove that too and convert the column to float from string

So essentially, data is cleaned now. Our next step would be do some Exploratory Data Analyses[EDA]

Exploratory Data Analysis

In this section, we would plot various graphs to see user-based patters, activity based patters, etc

Plotting data of some users using the magnitude of x, y and z

Segmenting the data and feature generation

  1. Splitting the dataset to create small segments
  2. Using each segment to generate 35 time and frequency domain features per axis of accelerometer reading
  3. This gives us a total of 105 features(x,y and z)

N.B. I have appended userId and activity type in the dataframe as it may act like a metadata later for splitting the data.

Generating windows of 30 second each, with overlap of 50% which is 15 seconds. You can see the output which verifies the same. It leaves us with a dataset of 3309 entries. This will later be split into test and training set

Data Splitting

We are going to be using leave-one-group-out for splitting the dataset into train and test set. To maintain consistency in results(from what Prof has shown in his models), I have first split data into two parts, with first 28 users in the training and the rest in the testing set.

Split the dataset and now the train set has 2446 entries and test has 863 entries

Saving the train and test dataframe as csv for future use and then getting rid of the column userID as it is of no use while training the model

Model Training

Label Encoding the activities using the following chart

0 - Downstairs

1 - Jogging

2 - Sitting

3 - Standing

4 - Upstairs

5 - Walking

Logistic Regression

Experiment 1: C=1, training_accuracy = 87.5% testing_accuracy = 76.36%
Experiment 2: C=0.5, training accuracy =87.8% testing_accuracy = 76.7%
Experiment 3: C=1, l1 loss, saga solver, training accuracy = 84.36%, testing accuracy = 77.28%

Support Vector Machine

Exp 1: Default SVC, training = 80.8%, testing = 75.20%
Exp 2: RBF SVC, C=100 training = 82.85%, testing = 77.52%
Exp 3: RBF SVC, C=10,training = 91.85%, testing = 78.80%
Exp 4: Poly SVC, training = , testing =

Random Forest Classifier

Exp 1: Default, training = 100%, testing = 87.83% -> Looks like overfitting, tuning max_depth and samples per leaf
Exp 2:n_estimators = 500, max_depth=10, training = 99.67%%, testing = 87.25%%

Extra Credit Part

I will be talking about Random Forest model since it gave the best performance. The training accuracy was 100% whle the testing accuracy was ~88%. Looking at the testing confusion matrix, it can be seen that there is a lot of confusion between Walking and Jogging, which is somewhat understandable as the speed might not be high for a lot of jogs or walk speed might be high for some people.

I am trying to plot to see if it is a user-specific problem or a general one.

Conclusion

Based on above data, it looks like User 30 is the problematic entry and for all his entries 'Walking' is misclassified as 'Jogging'. Upon delving deeper, it also appears that user 30 has no entry for jogging which makes it an unfit candidate for testing and it makes sense why there is an obvious mis-classification. We can try to remove these instances from User 30 as it looks like an outlier.